These presentation materials are part of a continuously updated tutorial on Python for High Performance Computing. To get the presentation materials as they were delivered today, please use the SC2012 tag on github. Future versions of this presentation will be found at:
https://github.com/pyhpc/pyhpc-tutorial
Please set all permanent bookmarks to this URL.
You can download a zip or tar ball from the github SC2012 tag:
wget --no-check-certificate https://github.com/pyhpc/pyhpc-tutorial/zipball/SC2012
wget --no-check-certificate https://github.com/pyhpc/pyhpc-tutorial/tarball/SC2012
2) Checkout from git
git clone https://github.com/pyhpc/pyhpc-tutorial.git
This tutorial is an interactive worksheet designed to encourage you to try out the lessons during the demonstration. If you are looking at the PDF version of these slides, we encourage you to download the updated version (see previous slide) and try the interactive version.
To run the interactive version of this notebook, you will need a Python 2.7 environment including:
Move to the directory containing the tarball and execute:
ipcluster start --engines=MPI --n 4
ipython notebook --pylab=inline
If you are installing the packages yourself, Continuum Analytics (disclaimer, Travis Oliphant is CEO of this company) provides both free community as well as professional versions of the Anaconda installer, which provides all the packages you will need for this portion of the tutorial. The installer is available from the Anaconda download page at Continuum Analytics.
You are also welcome to use Enthought's Python Distribution (free for Academic users), available from the EPD download page at Enthought.
Unfortunately, you will need to upgrade your EPD's IPython to 0.13 if you go this route, and this may be non-trivial, please see the Enthought discussion thread here.
The slideshow mode is only supported by an IPython development branch version. If you would like to write your own slideshows (the notebooks will work regardless of slideshow capabilities), you will need to install Matthias Bussonnier's slideshow_metadata branch. Here are some sample command-line instructions.
#!/bin/bash
git clone carreau git://github.com/Carreau/ipython.git # Matthias' IPython fork
git checkout origin/slideshow_metadata # Select the right branch
python setup.py develop # Install the development version
ipython notebook # Check out the slideshows!
In [1]:
# The below command starts pylab inline mode, which simplifies syntax for many common
# scientific computing commands. All commands starting with % are IPython extensions.
%pylab inline
# The following two lines of code load the slideshow extension and start it,
# you only need to run these if you are writing or running your own slideshow
%load_ext slidemode
%slideshow_mode
In [2]:
from IPython.parallel import Client
c = Client()
view = c[:]
%load_ext parallelmagic
view.activate()
view.block = True
This part of the tutorial focuses on data and collective parallelism
All modern supercomputing architectures are composed of distributed memory compute nodes connected over toroidal networks
We expect to see higher dimensional interconnects and power-efficient networks that consume less power when not sending traffic
many programming paradigms are very tightly coupled to the hardware beneath!
CUDA assumes large register files, Same Instruction Multiple Thread parallelism, and a mostly flat, structured memory model, matching the underlying GPU hardware
OpenMP exposes loop level parallelism with a fork/join model, assumes the presence of shared memory and atomics
OpenCL tries to generalize CUDA, but still assumes a 'coprocessor' approach, where kernels are shipped from a master core to worker cores
interprocess communication consists of
It Scales!
two initial communicators
MPI_COMM_WORLD
(all processes)MPI_COMM_SELF
(current process)For interactive convenience, we load the parallel magic extensions and make this view the active one for the automatic parallelism magics.
This is not necessary and in production codes likely won't be used, as the engines will load their own MPI codes separately. But it makes it easy to illustrate everything from within a single notebook here.
In [3]:
from IPython.parallel import Client
c = Client()
view = c[:]
%load_ext parallelmagic
view.activate()
view.block = True
Use the autopx magic to make the rest of this cell execute on the engines instead of locally
In [4]:
%autopx
With autopx enabled, the next cell will actually execute entirely on each engine:
In [5]:
from mpi4py import MPI
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
print("Hello World! I am process %d of %d on %s.\n" % (rank, size, name))
two initial communicators
MPI_COMM_WORLD
(all processes)MPI_COMM_SELF
(current process)a datatype is recursively defined as:
MPI_INT
, MPI_DOUBLE
)there are MPI functions to construct custom datatypes
ipcluster
command
In [6]:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = {'a': 7, 'b': 3.14}
comm.send(data, dest=1, tag=11)
elif rank == 1:
data = comm.recv(source=0, tag=11)
print data
In [7]:
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# pass explicit MPI datatypes
if rank == 0:
data = numpy.arange(1000, dtype='i')
comm.Send([data, MPI.INT], dest=1, tag=77)
elif rank == 1:
data = numpy.empty(1000, dtype='i')
comm.Recv([data, MPI.INT], source=0, tag=77)
# or take advantage of automatic MPI datatype discovery
if rank == 0:
data = numpy.arange(100, dtype=numpy.float64)
comm.Send(data, dest=1, tag=13)
elif rank == 1:
data = numpy.empty(100, dtype=numpy.float64)
comm.Recv(data, source=0, tag=13)
print data
In [8]:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
for r_id in range(comm.Get_size()):
if rank == r_id:
print "Hello from proc:", rank
comm.Barrier()
In [9]:
t1 = MPI.Wtime()
t2 = MPI.Wtime()
print("time elapsed is: %e\n" % (t2-t1))
In [10]:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = {'key1' : [7, 2.72, 2+3j],
'key2' : ( 'abc', 'xyz')}
else:
data = None
data = comm.bcast(data, root=0)
print "bcast finished and data \
on rank %d is: "%comm.rank, data
In [11]:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
data = [(i+1)**2 for i in range(size)]
else:
data = None
data = comm.scatter(data, root=0)
assert data == (rank+1)**2
print "data on rank %d is: "%comm.rank, data
In [12]:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
data = (rank+1)**2
print "before gather, data on \
rank %d is: "%rank, data
comm.Barrier()
data = comm.gather(data, root=0)
if rank == 0:
for i in range(size):
assert data[i] == (i+1)**2
else:
assert data is None
print "data on rank: %d is: "%rank, data
In [13]:
from mpi4py import MPI
comm = MPI.COMM_WORLD
sendmsg = comm.rank
recvmsg1 = comm.reduce(sendmsg, op=MPI.SUM, root=0)
recvmsg2 = comm.allreduce(sendmsg)
print recvmsg2
In [14]:
from mpi4py import MPI
import math
def compute_pi(n, start=0, step=1):
h = 1.0 / n
s = 0.0
for i in range(start, n, step):
x = h * (i + 0.5)
s += 4.0 / (1.0 + x**2)
return s * h
comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
myrank = comm.Get_rank()
if myrank == 0:
n = 10
else:
n = None
n = comm.bcast(n, root=0)
mypi = compute_pi(n, myrank, nprocs)
pi = comm.reduce(mypi, op=MPI.SUM, root=0)
if myrank == 0:
error = abs(pi - math.pi)
print ("pi is approximately %.16f, error is %.16f" % (pi, error))
In [15]:
def mandelbrot (x, y, maxit):
c = x + y*1j
z = 0 + 0j
it = 0
while abs(z) < 2 and it < maxit:
z = z**2 + c
it += 1
return it
x1, x2 = -2.0, 1.0
y1, y2 = -1.0, 1.0
w, h = 150, 100
maxit = 127
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
# number of rows to compute here
N = h // size + (h % size > rank)
# first row to compute here
start = comm.scan(N)-N
# array to store local result
Cl = numpy.zeros([N, w], dtype='i')
# compute owned rows
dx = (x2 - x1) / w
dy = (y2 - y1) / h
for i in range(N):
y = y1 + (i + start) * dy
for j in range(w):
x = x1 + j * dx
Cl[i, j] = mandelbrot(x, y, maxit)
# gather results at root (process 0)
counts = comm.gather(N, root=0)
C = None
if rank == 0:
C = numpy.zeros([h, w], dtype='i')
rowtype = MPI.INT.Create_contiguous(w)
rowtype.Commit()
comm.Gatherv(sendbuf=[Cl, MPI.INT], recvbuf=[C, (counts, None), rowtype],root=0)
rowtype.Free()
We now need to "pull" the C array for plotting so we disable autopx. Make sure to re-enable it later on
In [16]:
%autopx
In [17]:
# CC is an array of C from all ranks, so we use CC[0]
CC = view['C']
ranks = view['rank']
# Do the plotting
from matplotlib import pyplot
# Some magic to get to MPI4PY rank 0, not necessarily engine id 0
pyplot.imshow(CC[ranks.index(0)], aspect='equal')
pyplot.spectral()
pyplot.show()
Toggle autopx back
In [18]:
%autopx